{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# LlamaParse With MongoDB\n",
    "\n",
    "<a href=\"https://colab.research.google.com/github/run-llama/llama_parse/blob/main/examples/demo_mongodb.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
    "\n",
    "In this notebook, we provide a straightforward example of using LlamaParse with MongoDB Atlas VectorSearch.\n",
    "\n",
    "We illustrate the process of using llama-parse to parse a PDF document, then index the document with a MongoDB vector store, and subsequently perform basic queries against this store.\n",
    "\n",
    "This notebook is structured similarly to quick start guides, aiming to introduce users to utilizing llama-parse in conjunction with a MongoDB Atlas VectorSearch."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Installation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install llama-index llama-parse\n",
    "%pip install llama-index-vector-stores-mongodb llama-index-llms-openai"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Setup API Keys"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "os.environ[\n",
    "    \"LLAMA_CLOUD_API_KEY\"\n",
    "] = \"\"  # Get it from https://cloud.llamaindex.ai/api-key\n",
    "os.environ[\"OPENAI_API_KEY\"] = \"\"  # Get it from https://platform.openai.com/api-keys"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# llama-parse is async-first, running the sync code in a notebook requires the use of nest_asyncio\n",
    "import nest_asyncio\n",
    "\n",
    "nest_asyncio.apply()\n",
    "\n",
    "import requests\n",
    "import pymongo\n",
    "\n",
    "from llama_index.vector_stores.mongodb import MongoDBAtlasVectorSearch\n",
    "from llama_parse import LlamaParse\n",
    "from llama_index.embeddings.openai import OpenAIEmbedding\n",
    "from llama_index.core import VectorStoreIndex, StorageContext\n",
    "from llama_index.core.node_parser import SimpleNodeParser"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Download Document\n",
    "\n",
    "We will use `Attention is all you need` paper."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Download complete.\n"
     ]
    }
   ],
   "source": [
    "# The URL of the file you want to download\n",
    "url = \"https://arxiv.org/pdf/1706.03762.pdf\"\n",
    "# The local path where you want to save the file\n",
    "file_path = \"./attention.pdf\"\n",
    "\n",
    "# Perform the HTTP request\n",
    "response = requests.get(url)\n",
    "\n",
    "# Check if the request was successful\n",
    "if response.status_code == 200:\n",
    "    # Open the file in binary write mode and save the content\n",
    "    with open(file_path, \"wb\") as file:\n",
    "        file.write(response.content)\n",
    "    print(\"Download complete.\")\n",
    "else:\n",
    "    print(\"Error downloading the file.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Parse the document using `LlamaParse`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Started parsing the file under job_id 09a49745-9f21-4190-9de8-27e4e1a4bdf5\n"
     ]
    }
   ],
   "source": [
    "documents = LlamaParse(result_type=\"text\").load_data(file_path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "rmer - model architecture.\n",
      "The Transformer follows this overall architecture using stacked self-attention and point-wise, fully\n",
      "connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1,\n",
      "respectively.\n",
      "3.1   Encoder and Decoder Stacks\n",
      "Encoder:     The encoder is composed of a stack of N = 6 identical layers. Each layer has two\n",
      "sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-\n",
      "wise fully connected feed-forward network. We employ a residual connection [11] around each of\n",
      "the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is\n",
      "LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer\n",
      "itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding\n",
      "layers, produce outputs of dimension dmodel = 512.\n",
      "Decoder:    The decoder is also composed of a stack of N = 6 identical layers. In addition \n"
     ]
    }
   ],
   "source": [
    "# Take a quick look at some of the parsed text from the document:\n",
    "print(documents[0].get_content()[10000:11000])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create `MongoDBAtlasVectorSearch`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mongo_uri = os.environ[\"MONGO_URI\"]\n",
    "\n",
    "mongodb_client = pymongo.MongoClient(mongo_uri)\n",
    "mongodb_vector_store = MongoDBAtlasVectorSearch(mongodb_client)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create nodes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "node_parser = SimpleNodeParser()\n",
    "\n",
    "nodes = node_parser.get_nodes_from_documents(documents)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create Index and Query Engine."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "storage_context = StorageContext.from_defaults(vector_store=mongodb_vector_store)\n",
    "\n",
    "index = VectorStoreIndex(\n",
    "    nodes=nodes,\n",
    "    storage_context=storage_context,\n",
    "    embed_model=OpenAIEmbedding(),\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "query_engine = index.as_query_engine(similarity_top_k=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Test Query"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "***********New LlamaParse+ Basic Query Engine***********\n",
      "The BLEU score on the WMT 2014 English-to-German translation task is 28.4.\n"
     ]
    }
   ],
   "source": [
    "query = \"What is BLEU score on the WMT 2014 English-to-German translation task?\"\n",
    "\n",
    "response = query_engine.query(query)\n",
    "print(\"\\n***********New LlamaParse+ Basic Query Engine***********\")\n",
    "print(response)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "We varied the learning\n",
      "rate over the course of training, according to the formula:\n",
      "               lrate = d−0.5                                                                          (3)\n",
      "                         model · min(step_num−0.5, step_num · warmup_steps−1.5)\n",
      "This corresponds to increasing the learning rate linearly for the first warmup_steps training steps,\n",
      "and decreasing it thereafter proportionally to the inverse square root of the step number. We used\n",
      "warmup_steps = 4000.\n",
      "5.4   Regularization\n",
      "We employ three types of regularization during training:\n",
      "                                                    7\n",
      "---\n",
      "Table 2: The Transformer achieves better BLEU scores than previous state-of-the-art models on the\n",
      "English-to-German and English-to-French newstest2014 tests at a fraction of the training cost.\n",
      "        Model                                          BLEU               Training Cost (FLOPs)\n",
      "                                                 EN-DE      EN-FR          EN-DE         EN-FR\n",
      "        ByteNet [18]                              23.75\n",
      "        Deep-Att + PosUnk [39]                               39.2                       1.0 · 1020\n",
      "        GNMT + RL [38]                            24.6       39.92        2.3 · 1019    1.4 · 1020\n",
      "        ConvS2S [9]                               25.16      40.46        9.6 · 1018    1.5 · 1020\n",
      "        MoE [32]                                  26.03      40.56        2.0 · 1019    1.2 · 1020\n",
      "        Deep-Att + PosUnk Ensemble [39]                      40.4                       8.0 · 1020\n",
      "        GNMT + RL Ensemble [38]                   26.30      41.16        1.8 · 1020    1.1 · 1021\n",
      "        ConvS2S Ensemble [9]                      26.36      41.29        7.7 · 1019    1.2 · 1021\n",
      "        Transformer (base model)                  27.3       38.1               3.3 · 1018\n",
      "        Transformer (big)                         28.4       41.8                2.3 · 1019\n",
      "Residual Dropout       We apply dropout [33] to the output of each sub-layer, before it is added to the\n",
      "sub-layer input and normalized. In addition, we apply dropout to the sums of the embeddings and the\n",
      "positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of\n",
      "Pdrop = 0.1.\n",
      "Label Smoothing        During training, we employed label smoothing of value ϵls = 0.1 [36]. This\n",
      "hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.\n",
      "6    Results\n",
      "6.1   Machine Translation\n",
      "On the WMT 2014 English-to-German translation task, the big transformer model (Transformer (big)\n",
      "in Table 2) outperforms the best previously reported models (including ensembles) by more than 2.0\n",
      "BLEU, establishing a new state-of-the-art BLEU score of 28.4. The configuration of this model is\n",
      "listed in the bottom line of Table 3. Training took 3.5 days on 8 P100 GPUs. Even our base model\n",
      "surpasses all previously published models and ensembles, at a fraction of the training cost of any of\n",
      "the competitive models.\n",
      "On the WMT 2014 English-to-French translation task, our big model achieves a BLEU score of 41.0,\n",
      "outperforming all of the previously published single models, at less than 1/4 the training cost of the\n",
      "previous state-of-the-art model. The Transformer (big) model trained for English-to-French used\n",
      "dropout rate Pdrop = 0.1, instead of 0.3.\n",
      "For the base models, we used a single model obtained by averaging the last 5 checkpoints, which\n",
      "were written at 10-minute intervals. For the big models, we averaged the last 20 checkpoints. We\n",
      "used beam search with a beam size of 4 and length penalty α = 0.6 [38]. These hyperparameters\n",
      "were chosen after experimentation on the development set. We set the maximum output length during\n",
      "inference to input length + 50, but terminate early when possible [38].\n",
      "Table 2 summarizes our results and compares our translation quality and training costs to other model\n",
      "architectures from the literature.\n"
     ]
    }
   ],
   "source": [
    "# Take a look at one of the source nodes from the response\n",
    "print(response.source_nodes[0].get_content())"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "anthropic_env",
   "language": "python",
   "name": "anthropic_env"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  },
  "vscode": {
   "interpreter": {
    "hash": "b0fa6594d8f4cbf19f97940f81e996739fb7646882a419484c72d19e05852a7e"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
